Classification of metabolites on the presence of ApoE4 and AD

Preparation

Fitting a benchmark model: clinical characteristics only

Multi-class area under the curve: 0.814

Confusion Matrix and Statistics

          Reference
Prediction  AD ADE4 SCD SCDE4
     AD    474  227   0    15
     ADE4  378   91   3    10
     SCD     2   18 193   419
     SCDE4   6    4 204   426

Overall Statistics
                                          
               Accuracy : 0.4794          
                 95% CI : (0.4595, 0.4993)
    No Information Rate : 0.3522          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.296           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: AD Class: ADE4 Class: SCD Class: SCDE4
Sensitivity             0.5512     0.26765    0.48250       0.4897
Specificity             0.8497     0.81643    0.78792       0.8662
Pos Pred Value          0.6620     0.18880    0.30538       0.6656
Neg Pred Value          0.7799     0.87475    0.88738       0.7574
Precision               0.6620     0.18880    0.30538       0.6656
Recall                  0.5512     0.26765    0.48250       0.4897
F1                      0.6015     0.22141    0.37403       0.5642
Prevalence              0.3482     0.13765    0.16194       0.3522
Detection Rate          0.1919     0.03684    0.07814       0.1725
Detection Prevalence    0.2899     0.19514    0.25587       0.2591
Balanced Accuracy       0.7004     0.54204    0.63521       0.6780

Regularized Multinomial Regression adding 230 metabolites

Multi-class area under the curve: 0.8365

Confusion Matrix and Statistics

          Reference
Prediction  AD ADE4 SCD SCDE4
     AD    576  231   0     0
     ADE4  284  101   0     8
     SCD     0    0 174   271
     SCDE4   0    8 226   591

Overall Statistics
                                          
               Accuracy : 0.5838          
                 95% CI : (0.5641, 0.6033)
    No Information Rate : 0.3522          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.42            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: AD Class: ADE4 Class: SCD Class: SCDE4
Sensitivity             0.6698     0.29706    0.43500       0.6793
Specificity             0.8565     0.86291    0.86908       0.8538
Pos Pred Value          0.7138     0.25700    0.39101       0.7164
Neg Pred Value          0.8292     0.88493    0.88840       0.8304
Precision               0.7138     0.25700    0.39101       0.7164
Recall                  0.6698     0.29706    0.43500       0.6793
F1                      0.6911     0.27558    0.41183       0.6973
Prevalence              0.3482     0.13765    0.16194       0.3522
Detection Rate          0.2332     0.04089    0.07045       0.2393
Detection Prevalence    0.3267     0.15911    0.18016       0.3340
Balanced Accuracy       0.7631     0.57998    0.65204       0.7665

Projection to Latent Factors

Determining folds... 
Determining optimal penalty value... 

Multinomial Logistic Regression adding 6 factors

Multi-class area under the curve: 0.8205

Confusion Matrix and Statistics

          Reference
Prediction  AD ADE4 SCD SCDE4
     AD    516  206   0    10
     ADE4  320   98   7    15
     SCD    13   22 200   325
     SCDE4  11   14 193   520

Overall Statistics
                                          
               Accuracy : 0.5401          
                 95% CI : (0.5202, 0.5599)
    No Information Rate : 0.3522          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3703          
                                          
 Mcnemar's Test P-Value : 5.264e-15       

Statistics by Class:

                     Class: AD Class: ADE4 Class: SCD Class: SCDE4
Sensitivity             0.6000     0.28824    0.50000       0.5977
Specificity             0.8658     0.83944    0.82609       0.8638
Pos Pred Value          0.7049     0.22273    0.35714       0.7046
Neg Pred Value          0.8021     0.88079    0.89529       0.7979
Precision               0.7049     0.22273    0.35714       0.7046
Recall                  0.6000     0.28824    0.50000       0.5977
F1                      0.6482     0.25128    0.41667       0.6468
Prevalence              0.3482     0.13765    0.16194       0.3522
Detection Rate          0.2089     0.03968    0.08097       0.2105
Detection Prevalence    0.2964     0.17814    0.22672       0.2988
Balanced Accuracy       0.7329     0.56384    0.66304       0.7307

Decision Tree

Multi-class area under the curve: 0.8218

Confusion Matrix and Statistics

          Reference
Prediction  AD ADE4 SCD SCDE4
     AD    580  195   2     4
     ADE4  241  121   4    23
     SCD    24   16 277   537
     SCDE4  15    8 117   306

Overall Statistics
                                          
               Accuracy : 0.5198          
                 95% CI : (0.4999, 0.5397)
    No Information Rate : 0.3522          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3586          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: AD Class: ADE4 Class: SCD Class: SCDE4
Sensitivity             0.6744     0.35588     0.6925       0.3517
Specificity             0.8752     0.87418     0.7213       0.9125
Pos Pred Value          0.7426     0.31105     0.3244       0.6861
Neg Pred Value          0.8342     0.89476     0.9239       0.7213
Precision               0.7426     0.31105     0.3244       0.6861
Recall                  0.6744     0.35588     0.6925       0.3517
F1                      0.7069     0.33196     0.4418       0.4650
Prevalence              0.3482     0.13765     0.1619       0.3522
Detection Rate          0.2348     0.04899     0.1121       0.1239
Detection Prevalence    0.3162     0.15749     0.3457       0.1806
Balanced Accuracy       0.7748     0.61503     0.7069       0.6321

XGBoost Forest

Multi-class area under the curve: 0.8392

Confusion Matrix and Statistics

          Reference
Prediction  AD ADE4 SCD SCDE4
     AD    646  202   2     0
     ADE4  203  125   1    15
     SCD     4    1 162   288
     SCDE4   7   12 235   567

Overall Statistics
                                          
               Accuracy : 0.6073          
                 95% CI : (0.5877, 0.6266)
    No Information Rate : 0.3522          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.4501          
                                          
 Mcnemar's Test P-Value : 0.03747         

Statistics by Class:

                     Class: AD Class: ADE4 Class: SCD Class: SCDE4
Sensitivity             0.7512     0.36765    0.40500       0.6517
Specificity             0.8733     0.89718    0.85845       0.8413
Pos Pred Value          0.7600     0.36337    0.35604       0.6906
Neg Pred Value          0.8679     0.89887    0.88189       0.8163
Precision               0.7600     0.36337    0.35604       0.6906
Recall                  0.7512     0.36765    0.40500       0.6517
F1                      0.7556     0.36550    0.37895       0.6706
Prevalence              0.3482     0.13765    0.16194       0.3522
Detection Rate          0.2615     0.05061    0.06559       0.2296
Detection Prevalence    0.3441     0.13927    0.18421       0.3324
Balanced Accuracy       0.8122     0.63242    0.63173       0.7465

Model Comparison

Observations:

  1. Adding serum metabolite information (either the full 230-metabolite matrix or its 6-factor projection) seems to increase the discriminatory power of the models.

  2. Fitting 6 ML-estimated factors obtained by the FMradio package (cummulatively explaining 30% of variace) yields increased classification performance, serving as a valuable dimension reduction technique for high-dimensional data.

  3. Looking at the confusion matrix and individual ROC curves, all models were able to discriminate better among certain classes (AD+E4/SCD+E4, AD+E4/SCD, AD-E4/SCD+E4 and AD-E4/SCD-E4) compared to others (AD+E4/AD-E4 and SCD+E4/SCD-E4).

AUC
Clinical features only 0.8139584
Clinical features + 230 metabolites 0.8364857
Clinical features + 6 latent factors 0.8204809
Decision Tree 0.8217917
XGBoost 0.8392302